Goto

Collaborating Authors

 binary code


Transforming Generic Coder LLMs to Effective Binary Code Embedding Models for Similarity Detection

Neural Information Processing Systems

Cybersecurity and software research have crossed paths with modern deep learning research for a few years. The power of large language models (LLMs) in particular has intrigued us to apply them to understanding binary code. In this paper, we investigate some of the many ways LLMs can be applied to binary code similarity detection, as it is a significantly more difficult task compared to source code similarity detection due to the sparsity of information and less meaningful syntax. It also has great practical implications, such as vulnerability and malware detection. We find that pretrained LLMs are mostly capable of detecting similar binary code, even with a zero-shot setting. Our main contributions and findings are to provide several supervised fine-tuning methods that, when combined, significantly surpass zero-shot LLMs and state-of-the-art binary code similarity detection methods.



Optimizing affinity-based binary hashing using auxiliary coordinates

Neural Information Processing Systems

In supervised binary hashing, one wants to learn a function that maps a highdimensional feature vector to a vector of binary codes, for application to fast image retrieval. This typically results in a difficult optimization problem, nonconvex and nonsmooth, because of the discrete variables involved. Much work has simply relaxed the problem during training, solving a continuous optimization, and truncating the codes a posteriori. This gives reasonable results but is quite suboptimal. Recent work has tried to optimize the objective directly over the binary codes and achieved better results, but the hash function was still learned a posteriori, which remains suboptimal. We propose a general framework for learning hash functions using affinity-based loss functions that uses auxiliary coordinates. This closes the loop and optimizes jointly over the hash functions and the binary codes so that they gradually match each other. The resulting algorithm can be seen as an iterated version of the procedure of optimizing first over the codes and then learning the hash function. Compared to this, our optimization is guaranteed to obtain better hash functions while being not much slower, as demonstrated experimentally in various supervised datasets.


Optimizing affinity-based binary hashing using auxiliary coordinates

Neural Information Processing Systems

In supervised binary hashing, one wants to learn a function that maps a high-dimensional feature vector to a vector of binary codes, for application to fast image retrieval. This typically results in a difficult optimization problem, nonconvex and nonsmooth, because of the discrete variables involved. Much work has simply relaxed the problem during training, solving a continuous optimization, and truncating the codes a posteriori. This gives reasonable results but is quite suboptimal. Recent work has tried to optimize the objective directly over the binary codes and achieved better results, but the hash function was still learned a posteriori, which remains suboptimal. We propose a general framework for learning hash functions using affinity-based loss functions that uses auxiliary coordinates. This closes the loop and optimizes jointly over the hash functions and the binary codes so that they gradually match each other. The resulting algorithm can be seen as an iterated version of the procedure of optimizing first over the codes and then learning the hash function. Compared to this, our optimization is guaranteed to obtain better hash functions while being not much slower, as demonstrated experimentally in various supervised datasets.


IDEA: An Invariant Perspective for Efficient Domain Adaptive Image Retrieval

Neural Information Processing Systems

More importantly, we employ a generative model for synthetic samples to simulate the intervention of various non-causal effects, thereby minimizing their impact on hash codes for domain invariance. Comprehensive experiments conducted on benchmark datasets confirm the superior performance of our proposed IDEA compared to a variety of competitive baselines.